Text Classification and Multilinguism: Getting at Words via N-grams of Characters

نویسندگان

  • Ismaïl Biskri
  • Sylvain Delisle
چکیده

Genuine numerical multilingual text classification is almost impossible if only words are treated as the privileged unit of information. Although text tokenization (in which words are considered as tokens) is relatively easy in English or French, it is much more difficult for other languages such as German or Arabic. Moreover, stemming, typically used to normalize and reduce the size of the lexicon, constitutes another challenge. The notion of N-grams of words (i.e. sequences of N words, with N typically equals to 2, 3 or 4) which, for the last ten years, seems to have produced good results both in language identification and speech analysis, has recently become a privileged research axis in several areas of knowledge extraction from text. In this paper, we present a text classification software based on N-grams of characters (not words), evaluate its results on documents containing text written in English and French, and compare these results with those obtained from a different classification tool based exclusively on the processing of words. An interesting feature of our software is that it does not need to perform any language-specific processing and is thus appropriate for multilingual text classification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving K-Nearest Neighbor Efficacy for Farsi Text Classification

One of the common processes in the field of text mining is text classification.Because of the complex nature of Farsi language, words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts.K-Nearest Neighbors (KNN) is one of the most popular used methods for text classification and presents good performance in experiments on different d...

متن کامل

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText (Joulin et al., 2016) an...

متن کامل

A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese

In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the ...

متن کامل

Experimenting N-Grams in Text Categorization

This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and tw...

متن کامل

Syntactic Dependency-Based N-grams as Classification Features

In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002